arXiv: 1501.02844v 1 [stat.ML] 12 Jan 2015 


SPRITE: A Response Model 
For Multiple Choice Testing 


RYAN NING 

Rice University 
ryan.ning@sparfa.com 

CHRISTOPH STUDER 

Cornell University 
studer@sparfa.com 


ANDREW E. WATERS 

Open Stax 

waters@sparfa.com 

RICHARD G. BARANIUK 

Rice University 
richb@sparfa.com 


Item response theory (IRT) models for categorical response data are widely used in the analysis of edu¬ 
cational data, computerized adaptive testing, and psychological surveys. However, most IRT models rely 
on both the assumption that categories arc strictly ordered and the assumption that this ordering is known 
a priori. These assumptions are impractical in many real-world scenarios, such as multiple-choice ex¬ 
ams where the levels of incorrectness for the distractor categories are often unknown. While a number of 
results exist on IRT models for unordered categorical data, they tend to have restrictive modeling assump¬ 
tions that lead to poor data fitting performance in practice. Furthermore, existing unordered categorical 
models have parameters that are difficult to interpret. In this work, we propose a novel methodology for 
unordered categorical IRT that we call SPRITE (short for stochastic polytomous response item model) 
that: (i) analyzes both ordered and unordered categories, (ii) offers interpretable outputs, and (iii) pro¬ 
vides improved data fitting compared to existing models. We compare SPRITE to existing item response 
models and demonstrate its efficacy on both synthetic and real-world educational datasets. 


1 Introduction 


1.1 Item Response Theory 

A common task in education is evaluating how well learners in a class have mastered some set of 
competencies. This task is almost universally carried out through some form of testing, typically 
where a student is given a set of questions, and their ability is measured simply by counting the 
number of questions they answer correctly. This simple method of counting the number of 


correct responses is called classical test theory (CTT) (Bechger et al., 2003). However, CTT 


ignores valuable information in the way respondents interact with each question. For example, 
two respondents can have the same number of correct responses but have completely different 
areas of mastery—information that cannot be modeled by the simple aggregate score. 
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Item response theory (IRT), in contrast, models the interaction between each iterrQ and re¬ 
spondent to learn information regarding respondent and item characteristics (Nering and Ostini, 
2011). Concretely, IRT explicitly models the probability that a respondent will choose each 
multiple choice category (option) within a question. IRT is widely considered to be superior 
than classical test theory in both efficiency and accuracy to achieve high precision in measuring 
a respondent’s characteristics with a smaller number of questions (Nering and Ostini, 2011). 
In recent years, IRT has seen wide-spread adoption in analyzing surveys, questionnaires, and 
standardized tests, such as the Graduate Record Examination (GRE) and Graduate Management 
Admission Test (GMAT) ([Ware et al., 20001). 


1.2 The Problem of Unordered Categories 

Data typically analyzed by IRT may have ordered or unordered categories^] These categories 
may be ordered on a scale (such as a survey questionnaire, where respondents are asked to pro¬ 
vide an answer on a scale from one to five) or they can be ordered in more abstract ways, such as 
the correctness of a response to a test question. Most IRT models rely on two key assumptions: 
1) the categories are strictly ordered and 2) this ordering is known a priori. These assumptions 
are impractical in many real-world scenarios. For example, in a multiple-choice testing ques¬ 
tion with no strictly ordered categories, there may be a correct category and multiple distractor 
categories that are equally incorrect. Furthermore, even with ordered categories, the category 
ordering itself may not be known in advance. As a concrete example, assume the following 
multiple-choice question: What is the capital of Brazil? A) Sao Paulo, B) Belo Horizonte, C) 
Beijing, or D) Brasilia. Category D) is the correct answer; categories A) and B) are not correct 
but these cities are both in Brazil and hence, these two categories can be considered to be equally 
wrong; category C) is the worst choice since Beijing is not in South America. In this case, since 
categories B) and C) are equally incorrect, a strict ordering of the categories does not exist. 
Furthermore, even the non-strict ordering of the categories is often not known a priori unless 
a domain expert is providing this information (a costly procedure). To model such unorderec0 
categories, we need IRT models designed for unordered data. 


1.3 Prior Art 


Some IRT models for unordered data, such as the generalized partial credit model (GPCM) 
(Muraki, 1992), have restrictive modeling assumptions that degrade data-fitting performance 
in practice. Other IRT models for unordered data, such as the nominal response model (NRM) 
( |Bock, 1972] ), produce output parameters that are hard to interpret) This sen and Steinberg, 1997). 
In many applications, interpretability of the posterior model parameters provide insight into the 
underlying characteristics of the data. Thus, having interpretable output parameters is often 
critical. We discuss existing IRT models for unordered data in detail in Section |T2| 

An alternative class of models that have not been widely adopted in the IRT community 
are Bayesian ordinal models. State-of-the-art Bayesian methods (Johnson and Albert, 1999[) for 


'The term “item,” in IRT, is a general term for any response item. In the education and testing domain, the term 
’’item” corresponds to a test or homework question. 

2 The term “categories,” commonly used in the IRT community, refers to the multiple choice category labels 
within each question. These categories may be ordered in a meaningful way and should not be confused with 
strictly categorical data, where no particular ordering of the categories exists ( |Agresti, 2002 ] >. 

3 We use the term ’’unordered” to refer to categories with no ordering, including categories with partial ordering 
and ordered categories with a priori unknown category ordering. 
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(a) A set of three 
Gaussian SPRITE 
functions or “sprites.” 


(b) Probabilistic 

model underlying 
SPRITE. 


(c) SPRITE choice 
probabilities or item 
category response 
functions. 


(d) Category probabil¬ 
ities for latent trait Z'. 


Figure 1: Illustration of the SPRITE model: The location of the latent-predictor variable Z and 
the Gaussian category functions (referred to as “sprites”) induce a probability mass function de¬ 
termining the likelihood of the category y (out of A, B, and C) a respondent will choose, (a) The 
Gaussian functions or sprites associated with each category, (b) The SPRITE likelihood model, 
(c) The resulting choice probabilities as a function of Z. These curves are called item category 
response functions (ICRFs) by the IRT community, (d) The category choice probabilities for a 
particular latent-predictor value of Z' indicated by the vertical line in (a) and (c). The flexibility 
of SPRITE enables us to model ordered as well as unordered categories. 


modeling ordinal data generally rely on the assumption of a discrete set of ordered bins which, 
when combined with a latent predictor variable, induce a probability distribution over the set of 
ordered categories. We refer to this particular model as the ORD (short for ORDinal) model in 
the following. However, the ORD model can only be deployed on datasets with a known, strict 
ordering. To enable inference on unordered datasets, we make a slight modification to the ORD 
model to allow for unordered response data and refer to it as LORD (short for learned ordinal). 
We discuss ORD and LORD in detail in Section [2731 

1.4 Contributions 

We propose a novel IRT model for unordered categorical data that we dub SPRITE (short for 
stochastic polytomous response item model). For each respondent, SPRITE directly models 
the probability of choosing each category over the respondent’s latent parameter space. An 
illustration of SPRITE is shown in Figure [T] SPRITE produces meaningful category orderings 
and enables the analysis of both ordered and unordered categorical response data. In addition, 
SPRITE offers a high level of interpretability and provides statistics on the informativeness 
of questions and categories. Lastly, SPRITE provides (often significantly) better data-fitting 
performance than existing IRT models for unordered data. 

Table |T| demonstrates the superiority of SPRITE for a set of real-world datasets in terms of 
prediction performance. The details about the individual data sets are summarized in Table |2j 
In our experiments, we compare the data fitting ability of SPRITE against existing unordered 
IRT models by looking at predictive performance. We withhold 20% of the observed responses 
from the data and impute the missing responses using each model. We compute the prediction 
error rates as the number of false predictions over the total number of predictions and see that 
SPRITE outperforms the competing models for all considered datasets. 
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Table 1: Prediction error (the number of false predictions over the total number of predictions) 
and standard deviation of SPRITE, ORD ( Johnson and Albert, 1999 ), NRM ( Bock, 1972| ), and 
GPCM (Muraki, 1992) on various datasets. For ORD, the superscript L indicates that we used 
a modified version of the standard Bayesian ordinal model, where we learn the category order¬ 
ings directly from data, which are unavailable for that particular dataset (see Section [23] for the 
details). SPRITE obtains the best prediction performance on all datasets. 


Description SPRITE (L)ORD NRM GPCM 


Algebra test 

Computer engineering course 
Probability course 
Signals and systems course 
Comprehensive university exam 


0.25 ( 0 . 01 ) 
0.17 ( 0 . 01 ) 
0.41 ( 0 . 01 ) 
0.29 ( 0 . 01 ) 
0.53 ( 0 . 01 ) 


0.29 (0.02) 
0.31 (0.02) L 
0.68 (0.03) L 
0.48 (0.02) L 
0.71 (0.01) L 


0.31 (0.01) 
0.33 (0.01) 
0.57 (0.01) 
0.41 (0.01) 
0.63 (0.01) 


0.26 (0.01) 
0.21 (0.01) 
0.62 (0.01) 
0.56 (0.01) 
0.62 (0.01) 


1.5 Paper Outline 

This paper is organized as follows. In Section [2| we review IRT modeling and prior art for 
unordered categorical IRT. In Section [3j we introduce the SPRITE model. In Section [4j we 
develop a Markov Chain Monte-Carlo (MCMC) sampling method for SPRITE. In Section [5] 
we present results for both synthetic and real-world data experiments. We conclude in Section[6] 
with a summary and directions for future research. 


2 Existing Statistical Models For Item Response Theory 


We describe our notation and the IRT modeling assumptions in Section |2Tj We present existing 
IRT models in Section |23| and existing Bayesian ordinal models in Section [23] 

2.1 IRT Notation and Modeling Assumptions 

Assume that we have a dataset consisting of N respondents (for example, test takers on an 
exam) interacting with Q questions (for example, multiple-choice questions in an educational 
scenario). The observed data matrix Y consists of all the observed interactions between respon¬ 
dents and questions with Yy denoting the interaction result between the z th respondent and the 
j lh question. We assume that one observes polytomous (i.e., more than two categories per ques¬ 
tion) interaction data, i.e., Yy E { 1 ,..., Mj}, where Mj denotes the number of categories for 
question j. We allow the number of categories Mj to vary across different questions. In many 
practical scenarios not every interaction Yy is observed. Consequently, let Q ohs define the index 
set of entries for which we observe data. 

We now wish to model the observed outcomes Yy £ f2 obs in a statistically principled way. 
We will assume a linear predictor Zij that induces a discrete probability distribution over the 
set of Mj categories. There are many models available for defining the predictor Z,y, including 
linear regression (Gelman et al., 1995; |Brshop and N asrabadi, 2006), low-rank models (Recht 
et al., 20 IQ] |Zhou et al., 2010] |Lan et al., 2014| ), and cluster-based models dBusse et al., 2007[ ). 


In this work, we will confine ourselves to Rasch-type models (Rasch, 1993). We focus on the 
Rasch model both for its simplicity, as well as for its applicability to a wide variety of ordi¬ 


nal data-modeling problems (Rasch, 1993 Pallant and Tennant, 2007; Bond and Fox, 2013). 
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We note that our proposed models can easily be combined with more advanced predictor mod¬ 


els such as multi-dimensional IRT (MIRT) (Beguin and Glas, 2001) and sparse factor-analysis 
techniques (|Lan et al., 2014[). Put simply, the Rasch model defines a random effect 9^ E M for 


all respondents i — 1,..., N, as well as a random effect ay e M. for all questions j = 1 ,,Q. 
The linear predictor variable Z tJ is then given by Z tJ = Oi — ot.j. In an educational context, Oi 
corresponds to z th learner’s ability and ay corresponds to j th question’s intrinsic difficulty. 

2.2 Existing IRT Models 

IRT can be viewed as a generalized latent variable modeling technique. While numerous models 
for IRT exist in the literature, only a few of them are suitable for unordered categorical response 
data. In particular, GPCM ( [Muraki, 1992j ) and NRM (Bock^ 1972) can be used to analyze 
unordered categorical response data. We now describe each existing IRT model in detail. 


2.2.1 Generalized Partial Credit Model (GPCM) 


The GPCM is a generalized version of the strictly ordinal partial credit model (Masters, 1982) 


that allows partial ordering of the categories ( [Muraki, 1992} . GPCM is constructed from suc¬ 
cessive dichotomization of adjacent categories. Under GPCM, the probability that respondent i 
will choose category y for question j is given by 


P( Y H = y\ d i,Pj, a o) = 


ex p(ELi PA 9 * 


( ^JV ) ) 




(i) 


where /3j is a single discrimination factor for the j th question and ay = [ayy ...; ay M J repre¬ 
sents threshold values where adjacent categories have equal probability of being chosen. We 
refer the reader to ( [Muraki, 1992 ) for the derivation of ([[]). Although GPCM can be used to 
analyze unordered categorical responses, the model still tries to impose a strict ordering of the 
categories. Intuitively, GPCM says that the probability of choosing category y is proportional 
to the probability of successively choosing y adjacent pairs of categories (i.e., for a respondent 
to choose category 3, they have to first choose category 2 over category 1, and then choose cat¬ 
egory 3 over category 2, and finally choose category 3 over category 4). This construction of 
successive binary choices does not allow overlapping of the categories. As a result of this restric¬ 
tive modeling assumption, GPCM does not perform well under conditions where the categories 
overlap. 


2.2.2 Nominal Response Model (NRM) 


The NRM (Bock, 1972) is suitable for modeling categorical response data with no particular 
order. NRM is useful when multiple categories are equally good or the ordering of categories 
is unknown. NRM uses independent exponential functions to model each categorical response. 
For NRM, the probability that respondent i will choose category y for question j is given by 


P(Yij = y 1 6j , f3j , Qy) = 


exp (P jy (9i - OLjy )) 


where / 3j = [d 3 u ■ ■ ■ ; Pj m 3 ] is a vector of discrimination factors for the categories in the j th 
question and ay = [ayi;...; ay Ay] is a vector of intrinsic difficulties for the categories in the 
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j“‘ question. NRM does not explicitly model the order of the categories. Instead, it learns 
two parameters f3j k and ay fc for each category. The main limitation of NRM is interpretability 


(Thissen and Steinberg, 1997). In GPCM, the a values are ordered threshold values for choosing 


the next (more correct) category, and practitioners can use the a values to directly understand 
the ordering of the categories. In NRM, the concept of thresholds does not exist. NRM leams 
two interacting parameters a and (5 (for each category) that jointly determine the ordering of 
categories. Unlike the ordered a values in GPCM, relative values of a in NRM do not provide 
intuitive ordering of the categories. 


2.3 Existing Bayesian Ordinal Models 


Assuming a known and fixed ordering of categories, the standard ordinal model (Johnson and 


Albert, 1999) posits a latent random variable ZL, V(i, j) e Q obs , defined as 


Z[: — Zij + £ — Qi — OLj + £, 


( 2 ) 


where e is assumed to be a standard normal random variable. The model further requires a set 
of ordered bin positions on the j th question denoted by — oo = 7 ° < 7 I < • • • < = 00 , 

which map the latent predictor variable Z[- into one of the M 3 polytomous response categories 
as follows 


Yij = V if 


Vye 


A common constraint imposed on the bin positions is 7 J = 0 which avoids identifiability prob¬ 


lems where the bin positions could be shifted and scaled arbitrarily (Johnson and Albert, 1999). 
With ([2]), we can express the likelihood of selecting category y e {1,..., M,} as follows 

P(Yij = vlZivij) = tT 1 


Zij). 


(3) 


Here, <f>(a;) is the cumulative distribution function of the standard normal distribution, defined as 
$(x) = /- 00 7b ex P(ur) d x. Figure|2jillustrates the ORD model. As noted in the introduction, 
in a large number of practical applications the exact ordering of the categories is unknown a 
priori. Instead, we are faced with having only a set of M 3 labels for question j and must learn 
the natural ordering of these labels from data. To be able to analyze unordered categorical 
data, we slightly modify ORD and introduce the learned ordinal (LORD) model that fuses the 
ORD model with a model on the ordering of the category labels. This modification requires 
one to learn a permutation 7 r that maps the M 3 labels into a new set of M 3 ordered values. 
There are M 3 \ (where M 3 is the number of categories) such permutations which we denote by 

Given a specific permutation 7 rj, we can rewrite the generative likelihood of ([3]) as 

P(Y<j = y\ = 4 >( 7 ?'"' > - Zn) - ~ Zij). 

The prior distributions on each of the latent parameters of interest are given by 


e.t 


a 3 ~ 


■A/"(0, v e ) 
Af(0, v a ) 


7 j 


7lj ~ 


A/"(0, z/ 7 ) 


where u 9 , u a and v y are hyperparameters that define the prior variance for the latent respondent 
parameters, latent question parameters and the bin positions respectively. JV and U represent the 
Gaussian distribution and uniform distribution respectively. 
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Latent Predictor Z 



(a) ORD category density (b) ORD model (c) Category probabilities at 

function Z' 


Figure 2: Illustration of the ORD model: The location of latent predictor variable Z and the bin 
positions 7 induce a probability mass function determining the probability of category y (out of 
A, B, and C) a respondent will choose, (a) Z' determines the mean of the Gaussian function 
shown, the dashed lines determine the category bin locations, (b) The ORD likelihood model, 
(c) Category probabilities at location Z'. 


3 The SPRITE Model 


We now introduce our proposed SPRITE model. The models discussed in Section [273] combine 
the latent predictor Z l2 with the bin positions 7 • in order to generate the observed response Y Vj . 
The SPRITE model, by contrast, does not rely on a set of bins, but rather on distributions over 
the categories themselves (see Figure [l]for an illustration of this principle). For each question j, 
each category k G {1,..., Mj} specifies a Gaussian function with mean y k and variance v k . We 
call each category’s Gaussian function a ’’sprite”. We model the probability that respondent i 
will select category y of question j given the value of the latent predictor variable Z %3 as follows 


P{Ya = y\Zii) = 


I'j -"?) 




As described in Section 2.1, Z VJ = 0, — a :j . Since, given //■, the parameter a :) does not offer 


any additional information, we will absorb the atj terms directly into the mean //■' and use Z tJ = 
Zi = Qi as the latent predictor. We therefore re-express the SPRITE likelihood equation using 

Z, as 


P{Y ij = y\Z i ) 


A'(Z, ir'.iZ) 


(4) 


Figure[l](a) shows the item category density functions or ’’sprites” of each of the three categories 
induced over Z. Figure |T](c) shows the item category response functions (ICRFs) that plots the 
probability of choosing each category as a function of Z. 

The prior distributions on each of the latent parameter of interest are given by 


l^j ~ A/”(0, Zi ~ AT(n z , v z ) 

~ ZQ[a v , f3 u ) y <£ G obs ~ U{ 1,..., Mj], (5) 
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where XQ{a, (3) denotes the inverse gamma distribution with shape parameter a and scale pa¬ 
rameter /3, and z/ /; , a u , j3 v , jj z , v z are hyperparameters for the prior distributions of the latent 
category mean, category variance, and respondent latent trait. We treat missing observations 
U ^ Hobs as latent parameters and use a uniform prior distribution on them. The associated 
inference details are given in Section [4~T 

Like all models for IRT used to analyze unordered categorical data, SPRITE can be suscep¬ 
tible to identifiability issues in the data. For example, one can negate all the learned categories 
means /ij and at the same time negate the inferred respondent latent ability parameters 0 :l without 
affecting the model likelihood. To prevent identifiability issues, we fix the mean of one sprite 
(typically the sprite whose category corresponds to the correct answer) to zero and its variance 
to one. SPRITE’S modeling flexibility allows overlapping categories with no strict ordering. 
Furthermore, SPRITE’S category mean and variance parameters offer superior interpretability 
compared to existing models. 


4 Inference With SPRITE 


We now detail our inference method for SPRITE. We first note that, under the Bayesian setting, 
there exist a number of methods for fitting SPRITE to data. We will rely on Markov Chain 
Monte-Carlo (MCMC) sampling methods (Gelman et al., 1995), which are easy to deploy for 
our model. Unlike methods such as expectation maximization (EM) (Dempste r et al., 1977| 
Bishop and Nasrabadi, 2006) that produce point estimates, an MCMC-based approach provides 
full posterior distributions. 


4.1 MCMC Sampler for SPRITE 


We present a Metropolis-within-Gibbs sampler ( [Gilks et al., 1995 ) for SPRITE. The SPRITE 
latent variables Z, fi j . and Uj for i = 1 ,,N and j = 1..... Q arc sampled via a Metropolis- 
Hastings step at each MCMC iteration. Here, we introduce the vector notation /x - = [/i 1 ;...; /i M y 
z/j = [z/ 1 ;...; z/ m j] and Z = [Zi ,...; Z N \. Furthermore, we treat missing observations as latent 
variables and sample them using Gibbs sampling. A summary of the steps used by our MCMC 
sampler are as follows. We use the notation [•]* to represent the state of a parameter at iteration 
number t. For t = 1,..., T where T denotes the total number of MCMC iterations, we perform 
the following steps: 


1 . Propose new latent traits [Ztf ~ N {[Ztf 1 , v z ) for i — 1 ,..., N. 

2. Propose new category means [fXj] t ~ A f (jjj/jf \ v^j for j = 1,..., Q and k = 1,..., Mj. 

3. Propose new category variances H? ~ lQ(a u ,/3'), where the updated scale parameter 
/ 3' = [z — 1) for j = 1,..., Q and k = 1,..., Mj. Note that the mean of 
XQ(a v ,f3') is [zA] t_1 . 

4. Calculate a Metropolis-Hastings acceptance/rejection probability based on the likelihood 
ratio between proposed parameters and the parameters from the previous MCMC step. 
The likelihood is given by 0 using [y^-]* -1 , (i, j) ^ f2 obs and Y^, (i, j) e f2 0 bs- The 
proposed latent variables [Ztf, [/z^f, and [zA]* for i = 1 ,..., N, j = 1,..., Q and k = 
1 ,..., Mj are then jointly accepted or rejected. 
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5. Propose new prediction values for the missing responses [Y^]*, ( i,j) (f Q obs by Gibbs 
sampling the probabilities induced by ([4]) using [Z i ] t , ■ 

Above, v z , and a u are user-defined tuning parameters. 


4.2 Posterior Inference 

After a suitable bum-in period, the MCMC sampler detailed in Section [4~T| produces samples 
that approximate the true posterior distribution of all model parameters. We will make use of the 
posterior mean when performing experiments in which we compare to a known ground truth. For 
real data experiments with unknown ground-truth latent parameters, we gauge the performance 
of our models by measuring predictive accuracy (error metrics are presented in Section [572] ). We 
make predictions using fully Bayesian imputation (Kong et al., 1994j ) where we predict missing 
responses using the posterior mode of Y ij: (i.j) Q ohs . 


5 Experiments 

We first evaluate SPRITE using synthetic data to demonstrate model convergence, identifiability, 
and consistency. Then, we compare the predictive performance of SPRITE to other IRT models 
(detailed in Section[2]) using real-world educational datasets. 

5.1 Synthetic Data Experiments 

GENERATION OF Data We first generate the ground truth SPRITE model parameters Z, /r y , 
and Uj for i = 1,..., N and j = 1..... Q. For simplicity, we fix the number of categories per 
question to Mj = M = 5, Vj. We generate the latent parameters via ([5]) and the observed data 
Y via Q. In this experiment, the graded response matrix Y is assumed to be fully observed. 
The hyperparameters are as follows: ji z = 0, u z = 1, z/ /( = 1 , a u = 1, and d v = 1. 


PARAMETER ESTIMATION AND ERROR METRICS We deploy SPRITE as described in Sec¬ 
tion |4T| by initializing all parameters of interest with random values. We use 90,000 MCMC 
iterations in the bum-in phase and compute the posterior means for all parameters as described 
in Section |4~2| over an additional 10,000 iterations. We compare the learned SPRITE parameters 
to the known ground truth model using the following three error metrics 


E 7 = 


\Z-Z\ 


E» = 


\\ v \\ 2 


E„ = 


\v — v\ 




( 6 ) 


where Z , fi, and i> represent model estimated values as computed in Section 4.1 and Z, /i, and 
u represent the known ground-truth values. 


DISCUSSION Figure [3] displays box-whisker plots for the 3 error metrics in ([ 6 ]) for various 
problem sizes (number of questions and number of respondents). To simplify the presentation 
of results, we keep the number of questions and number of respondents the same for all problem 
sizes. The low error rates demonstrate SPRITE model identifiability and its convergence to the 
ground truth. The low standard deviation values demonstrate SPRITE model stability. Further¬ 
more, all error metrics decrease as the problem size increases, which implies model consistency. 
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u 50 100 150 u 50 100 150 

Problem Size N = Q Problem Size N = Q 

(a) Respondent latent trait error Ez • (b) Category mean error E^. 



u 50 100 150 

Problem Size N = Q 

(c) Category variance error E a . 


Figure 3: Synthetic experiment over various problem sizes (number of respondents N and num¬ 
ber of questions Q) where N = Q, and Mj = M = 5 categories, (a) The error of the latent trait 
Ez'- (b) The error in the means of the categories E^, (c) The error of category variance E a . All 
three error metrics decrease as the problem size (the amount of data) grows. 


5.2 Real-World Data Experiments 

We now compare the predictive performance of SPRITE against the NRM, the GPCM, and the 
ORD methods (described in Section[2]) using a variety of real-world datasets. We use the LORD 
method outlined in Section [23] in place of ORD when the category ordering is unknown a priori. 


DATASET DETAILS We study five educational datasets. A brief description of the datasets can 
be found in Table [2] The “algebra test” dataset is from a secondary level algebra test adminis¬ 
tered on Amazon’s Mechanical Turk (Lan et al., 2014). All questions are multiple choice ques¬ 
tions and a domain expert has provided an ordering to the multiple choice categories (according 
to the correctness of each category). The datasets “computer engineering course,” “probabil¬ 
ity course,” and “signals and systems course” are from college level courses administered on 
OpenStax Tutor (OpenStax Tutor, 2014). Each of these datasets contain a number of missing 
entires—corresponding to the case where students did not answer all available questions. Fi¬ 
nally, the “comprehensive university exam” dataset contains responses on an university level 
comprehensive exam ( |Vats et al., 2013 ). There are missing entries in this dataset because stu¬ 
dents were penalized less for choosing to skip a question instead of answering incorrectly. There 
is no a priori category ordering knowledge for all datasets except for the “algebra test” dataset, 
where a human expert has provided category ordering. 
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Table 2: Description of datasets. Unobserved data listed in the table refers to actual missing 
responses in the respective datasets; Q denotes the number of questions in each dataset and N 
denotes the number of respondents. 


Description 

Size (Q x N) 

Categories 

Ordered 

Observed data 

Algebra test 

34 x 99 

5 

Yes 

100 % 

Computer engineering course 

203 x 82 

12 

No 

97% 

Probability course 

86 x 49 

7 

No 

67% 

Signals and systems course 

143 x 44 

11 

No 

64% 

Comprehensive university exam 

60 x 1567 

4 

No 

71% 


EXPERIMENT SETUP We compare the predictive performance of the algorithms by first punc¬ 
turing (removing) a portion of the observed data and retaining these values for a test set. We set 
the rate of puncturing to be 20%. We then train each model using the remaining observed entries 
and make predictions on the test set. Once the models have been fit, we infer the missing entries 
as discussed in Section 4.2| The error metric used in all educational datasets is simply the num¬ 
ber of incorrect predictions divided by the total number of predictions made. All experiments 
were repeated over 50 random puncturing patterns. We use 90,000 MCMC sampling iterations 
for the burn-in period and compute our results over an additional 10,000 iterations. 


RESULTS AND DISCUSSION The predictive performance results of all models are summa¬ 
rized in Table [I] SPRITE outperforms all other models on all datasets. The ’’algebra test” 
dataset is especially interesting, where a human expert has provided a strict ordering of the 
categories. This expert provided ordering was used by ORD, which requires a priori known cat¬ 
egory ordering. SPRITE, on the other hand, learned a category ordering directly from the data 
without considering the one provided by the human expert. Compared to the category ordering 
provided by the human expert, SPRITE’S learned ordering is more flexible (allowing overlap¬ 
ping categories). Furthermore, SPRITE’S learned ordering is completely data driven and is not 
influenced by the human expert’s subjective opinion, which is often unreliable. SPRITE’S su¬ 
perior performance in the ’’algebra test” dataset demonstrates that the SPRITE learned category 
ordering explains the data better than the one provided by the human expert. 

These experiments show that SPRITE performs well against other IRT models on both or¬ 
dered and unordered categorical data. Furthermore, SPRITE often learns superior category or¬ 
derings than the ones provided by human experts. 


INTERPRETABILITY AND MUTUAL INFORMATION SPRITE’S category parameters fj, ? and za, 
provide an intuitive ordering of categories. Furthermore, SPRITE provides valuable statistics 
concerning question informativeness. In the context of education, the categories chosen by each 
learner provide information about their particular mastery of the material. Similarly, the learners 
inform SPRITE about how well each question/category discriminates learners with strong vs. 
weak mastery of the material. 

In particular, using the statistics provided by SPRITE, we can compute the mutual informa¬ 
tion I(Z; Yj ) (measured in bits) between the learners’ latent abilities Z e M. and the category 
choices Y ;) e 1,..., Mj made for each question j = 1,... . 0. The mutual information (MI) 
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Figure 4: Learned parameters of two questions using SPRITE from the “algebra test” dataset. 
The curves represent SPRITE category functions or sprites. The colors of the curves have no 
meaning and are only used to aid visual diambiguation of unique sprites, (a) Shows an informa¬ 
tive question with MI = 0.42 bits, (b) Shows a less informative question with MI = 0.08 bits. 


is able to reveal the informativeness or discriminative power for a given question j. The MI is 
defined as follows 



(7) 


Here, P(?/|z) is the likelihood function given by P(z) = Af(z\/i z , v z ) is the Gaussian prior on 
latent abilities given by (|5]), and Pfy) = j. ?iy\z)P(z)dz is a normalization term. The integral 
in t[7j) is difficult to evaluate in closed-form. However, it can be evaluated easily and accurately 
using numerical integration techniques. 

Figure [4] demonstrates the efficacy of the MI measure ([7]) for one informative and one less 
informative question in the “algebra test” dataset. The informative question (MI = 0.42 bit) 
illustrated in Figure |4](a) reveals that one sprite dominates in the positive Z region. This means 
that one category is able to distinguish the learner’s performance very well from the other four 
categories. By contrast, the less informative question (MI = 0.08 bit) illustrated in Figure [4] (b) 
reveals multiple overlapping sprites that show little discriminative power (all sprites are grouped 
fairly closely). In other words, the categories in this question fail to discriminate each learner’s 
latent understanding. Instructors can use such information to either improve the quality of the 
available test questions (by revising certain categories) or to determine a high-quality subset of 
questions, which is key for test-size reduction (|Vats et al., 2013). 


6 Conclusion 

We have developed SPRITE to model both ordered and unordered categorical response data. 
SPRITE improves upon the state-of-the-art IRT models both in interpretability and data fitting. 
Additionally, SPRITE provides valuable statistics regarding questions and categories (such as 
their efficacy and degree of information) that can be used to improve the quality of the test 
questions. Several future directions look promising. First, improvements to the SPRITE sampler 
could potentially improve the efficiency of the SPRITE inference algorithm. Methods such 
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as variational Bayes ( |Attias, 1999| ), expectation maximization ( |Bishop and Nasrabadi, 2 006) 
and Metropolis-Hastings Robbins-Monro ( Cai, 2010] ) may sacrifice little in term s of data fitting 
performance while providing improvements in computational time. Additionally, alternative 
models for the linear predictor Z, such as MIRT (Beguin and Glas, 2001) and linear regression 
models with either fixed or learned covariates, could easily provide additional improvements in 
terms of performance and interpretability. 
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